Introduction
The representation of women in media has long been a topic of interest, as it reflects societal norms and attitudes towards gender equality. Despite the progress made in recent decades towards gender equality, it is important to examine whether these changes are reflected in the films we watch. Movies provide a unique insight into the subconscious ways in which society is conditioned to view women, and can capture the ideals and norms of the time in which they were produced.
In this data analysis project, we will use the CMU Movie Summary Corpus dataset, completed with additional datasets, to explore the portrayal of women in film. We will not only analyze the roles of actresses and characters, but also of writers and directors. By analyzing these factors, we aim to gain a deeper understanding of how women are and have been represented in media, and how this has evolved through time.
The Data
Our analysis is based on merging the CMU Movie Summary Corpus dataset, the Stanford CoreNLP-processed summaries, IMDb, Wikidata, IMDB and Box office Mojo. We have separated the data into three tables: the movies table, the characters table, and the directors and writers table.
- The movies table contains titles, release year, runtime, box office revenue, average rating and number of votes on IMDb, genre, as well as the list of directors and writers. There are a total of 81,741 different movies.
- The characters table contains the ID of the movie, the character name and actor name, their height, ethnicity, birth and death year, the movie metric and the actor metric. There are 450,669 characters played by 135,761 different actors.
- The directors and writers table contains titles, role (either director or writer), name, gender, birth year, and height. There are a total of 86,474 directors and 164,271 writers.
The Impact Score metric
Movies
We have created a metric in order to measure the impact of a movie on the average rating and the number of votes. Our assumption is that an impactful movie has a lot of votes and has either an extremely good or bad average rating.
We apply a logarithmic transformation to the number of votes in order to turn its heavy-tailed distribution into a gaussian distribution, then we normalize the data and accurately compare the impact of different movies. We then take the absolute value of the normalized average rating for each movie. This accounts for both very good and very bad movies, as both have a significant impact on audience reception. By combining these two factors, we are able to calculate the overall impact a movie has on its audience and compare this across different films.
\[\textrm{Impact Score}_\textrm{Movies} = \textrm{normalized} (\log(\textrm{number of votes})) \cdot \textrm{abs}(\textrm{normalized}(\textrm{IMDB rating}))\]According to this metric, those are the top 10 most impactful movies of our dataset:
| Title | Average Rating | Number of Votes | Impact Score |
|---|---|---|---|
| The Shawshank Redemption | 9.3 | 2648879 | 9.90 |
| The Dark Knight | 9.0 | 2620838 | 8.91 |
| Inception | 8.8 | 2322848 | 8.15 |
| Fight Club | 8.8 | 2093849 | 8.05 |
| Forrest Gump | 8.8 | 2051278 | 8.03 |
| Pulp Fiction | 8.9 | 2027513 | 8.33 |
| The Matrix | 8.7 | 1894094 | 7.64 |
| The Lord of the Rings: The Fellowship of the Ring | 8.8 | 1851387 | 7.93 |
| The Godfather | 9.2 | 1836155 | 9.16 |
| The Lord of the Rings: The Return of the King | 9.0 | 1824685 | 8.53 |
Actors, writers and directors
For actors, writers, and directors, we use the Discounted Cumulative Gain to rank the movies they are linked to according to the impact score and compute their overall impact.
\[\textrm{Impact Score}_\textrm{Actors, Directors, Writers} = \sum_{i=1}^{\textrm{number of movies}}\frac{\textrm{movie metric}_i}{\log_2(i + 1)}\]Here are the top 10 actors, writers and directors with the highest impact score:
| Actors | Directors | Writers | |||
|---|---|---|---|---|---|
| Name | Impact Score | Name | Impact Score | Name | Impact Score |
| Samuel L. Jackson | 47.28 | Steven Spielberg | 35.52 | Stephen King | 35.70 |
| Robert De Niro | 45.92 | Martin Scorsese | 34.01 | George Lucas | 29.18 |
| Michael Caine | 42.68 | Alfred Hitchcock | 30.92 | Christopher Nolan | 29.14 |
| Morgan Freeman | 42.38 | Christopher Nolan | 29.12 | Bob Kane | 28.51 |
| Al Pacino | 39.39 | Francis Ford Coppola | 27.79 | Quentin Tarantino | 27.30 |
| Bruce Willis | 38.88 | Quentin Tarantino | 26.34 | Francis Ford Coppola | 26.90 |
| Gary Oldman | 37.17 | Akira Kurosawa | 24.82 | Akira Kurosawa | 26.66 |
| Robert Duvall | 36.77 | Stanley Kubrick | 24.71 | David S. Goyer | 25.18 |
| Tom Hanks | 36.71 | Clint Eastwood | 23.37 | Billy Wilder | 24.22 |
| Brad Pitt | 36.55 | Uwe Boll | 22.24 | Hayao Miyazaki | 24.02 |
We can see that the top 10 actors, writers and directors are all male, which leads to the following question: where are the women?
Where are the Women?
As our project focuses on the representation of women in movies, it can be interesting to look at the evolution of the presence of women in movies, as characters and as part of the crew.
From the graph above, we can see that women in crews have always been even more underrepresented than actresses.
Let’s break the data down by genre, to see if there is any difference in the distribution of women in film across genres.
From the graphs above, we can see that when it comes to genre, women are most often represented in dramas, comedies and romances, while they are underrepresented in action adventure and sci-fi films.
When considering the representation of women among directors and writers, we found that the pattern is similar, although the overall percentage of women in these roles is significantly lower than for actresses, as shown below.